Search CORE

139 research outputs found

Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

Author: Blelloch G. E.
Blelloch G. E.
Cormen T. H.
Da Zheng D. M.
Dasari N. S.
Gonzalez J. E.
Greenlaw R.
Karp R. M.
Low Y.
Maon Y.
Ramachandran V.
Shiloach Y.
Zhou W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/07/2019
Field of study

There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the Graph-Based Benchmark Suite (GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 201

arXiv.org e-Print Archive

Crossref

DSpace@MIT

A Sound and Complete Abstraction for Reasoning about Parallel Prefix Sums

Author: Blelloch G. E.
Harris M.
Pierce B. C.
Sklansky J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively par-allel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance. We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a sound-ness and completeness result showing that a generic sequential pre-fix sum implementation is correct for an array of length n if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the in

CiteSeerX

Crossref

Spiral - Imperial College Digital Repository

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

Author: Acar U. A.
Acar Umut A.
Agrawal Kunal
Agrawal Kunal
Akhremtsev Yaroslav
Arora N. S.
Ben-David Naama
Ben-David Naama
Blelloch Guy E
Blelloch Guy E
Blelloch Guy E
Blelloch Guy E.
Blelloch Guy E.
Blelloch Guy E.
Blumofe Robert D.
Cole Richard
Cole Richard
Cole Richard
Dhulipala Laxman
Dhulipala Laxman
Gil J.
Goodrich Michael T.
Gustedt Jens
Guy
Guy
Miller G.L.
Nievergelt Jürg
Rajasekaran S.
Valiant L. G.
Vishkin Uzi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/06/2020
Field of study

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an

\Omega(\log n)

overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized

arXiv.org e-Print Archive

Crossref

Efficient computation of hashes

Author: Bertoni G
Bertoni G
Bertoni G
Bertoni G
Bertoni G
Bertoni G
Blelloch G E
Blum M
Coron J S
Damgard I B
Hall E
Peter R Hobson
Raul H C Lopes
Ristenpart T
Sarkar P
Virginia N L Franqueira
Publication venue: 'IOP Publishing'
Publication date: 11/06/2014
Field of study

The sequential computation of hashes at the core of many distributed storage systems and found, for example, in grid services can hinder efficiency in service quality and even pose security challenges that can only be addressed by the use of parallel hash tree modes. The main contributions of this paper are, first, the identification of several efficiency and security challenges posed by the use of sequential hash computation based on the Merkle-Damgard engine. In addition, alternatives for the parallel computation of hash trees are discussed, and a prototype for a new parallel implementation of the Keccak function, the SHA-3 winner, is introduced

CLoK

Crossref

Kent Academic Repository

Brunel University Research Archive

UDORA - University of Derby Online Research Archive

The Parallel Persistent Memory Model

Author: Berryhill R.
Blelloch G. E.
Buettner M.
Chauhan H.
Herlihy M.
JaJa J.
Lee S. K.
Meena J. S.
Nawab F.
Pelley S.
Woude J. Van Der
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/06/2018
Field of study

We consider a parallel computational model that consists of

P

processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of

O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil)

in expectation, where

W

and

D

are the work and depth of the computation (in the absence of failures),

P_A

is the average number of processors available during the computation, and

f \le 1/2

is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

arXiv.org e-Print Archive

Crossref

DSpace@MIT

Validation of Methods to Predict Vibration of a Panel in the Near Field of a Hot Supersonic Rocket Plume

Author: Blelloch P. A.
Bremner P. G.
Hutchings A.
Larsen C. E.
Shah P.
Streett C. L.
Publication venue
Publication date: 06/06/2011
Field of study

This paper describes the measurement and analysis of surface fluctuating pressure level (FPL) data and vibration data from a plume impingement aero-acoustic and vibration (PIAAV) test to validate NASA s physics-based modeling methods for prediction of panel vibration in the near field of a hot supersonic rocket plume. For this test - reported more fully in a companion paper by Osterholt & Knox at 26th Aerospace Testing Seminar, 2011 - the flexible panel was located 2.4 nozzle diameters from the plume centerline and 4.3 nozzle diameters downstream from the nozzle exit. The FPL loading is analyzed in terms of its auto spectrum, its cross spectrum, its spatial correlation parameters and its statistical properties. The panel vibration data is used to estimate the in-situ damping under plume FPL loading conditions and to validate both finite element analysis (FEA) and statistical energy analysis (SEA) methods for prediction of panel response. An assessment is also made of the effects of non-linearity in the panel elasticity

NASA Technical Reports Server

An Experimental Analysis of Parallel Sorting Algorithms

Author: C. E. Leiserson B. M. Maggs
G. E. Blelloch
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

C**: A large-grain, object-oriented, data-parallel programming language

Author: G. E. Blelloch
T. A. Budd
W. D. Hillis
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures

Author: Akl S G
Bentley J L
Blelloch G E
Callahan P B
Cormen T
Har-Peled S
Hoecker A
Ivan D Reid
Knuth D E
McCool M
Moore A
Moore A W
Omohundro S M
Peter R Hobson
Raul H C Lopes
Samet H
Vaidya P M
Publication venue: 'IOP Publishing'
Publication date: 11/06/2014
Field of study

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.Variations of k-d trees represent a fundamental data structure used in Computational Geometry with numerous applications in science. For example particle track tting in the software of the LHC experiments, and in simulations of N-body systems in the study of dynamics of interacting galaxies, particle beam physics, and molecular dynamics in biochemistry. The many-body tree methods devised by Barnes and Hutt in the 1980s and the Fast Multipole Method introduced in 1987 by Greengard and Rokhlin use variants of k-d trees to reduce the computation time upper bounds to O(n log n) and even O(n) from O(n2). We present an algorithm that uses the principle of well-separated pairs decomposition to always produce compressed trees in O(n log n) work. We present and evaluate parallel implementations for the algorithm that can take advantage of multi-core architectures.The Science and Technology Facilities Council, UK

Crossref

Brunel University Research Archive

Parallel Write-Efficient Algorithms and Data Structures for Computational Geometry

Author: Akenine-Möller T.
Atallah M.
Blelloch G. E.
Chen S.
Cormen T. H.
Edelsbrunner H.
Guibas L. J.
Har-Peled S.
JaJa J.
Manolopoulos Y.
Mulmuley K.
Overmars M. H.
Seidel R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/07/2018
Field of study

In this paper, we design parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem. This is motivated by emerging non-volatile memory technologies with read performance being close to that of random access memory but writes being significantly more expensive in terms of energy and latency. We design algorithms for planar Delaunay triangulation,

k

-d trees, and static and dynamic augmented trees. Our algorithms are designed in the recently introduced Asymmetric Nested-Parallel Model, which captures the parallel setting in which there is a small symmetric memory where reads and writes are unit cost as well as a large asymmetric memory where writes are

\omega

times more expensive than reads. In designing these algorithms, we introduce several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, reconstruction-based rebalancing and

\alpha

-labeling, which we believe will be useful for designing other parallel write-efficient algorithms

arXiv.org e-Print Archive

Crossref

DSpace@MIT